{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Planning Algorithms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do you remember on lesson 2 and 3 we discussed algorithms that basically solve MDPs? That is, find a policy given a exact representation of the environment. In this section, we will explore 2 such algorithms. Value Iteration and policy iteration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import tempfile\n",
    "import pprint\n",
    "import json\n",
    "import sys\n",
    "import gym\n",
    "\n",
    "from gym import wrappers\n",
    "from subprocess import check_output\n",
    "from IPython.display import HTML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Value Iteration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Value Iteration algorithm uses dynamic programming by dividing the problem into common sub-problems and leveraging that optimal structure to speed-up computations.\n",
    "\n",
    "Let me show you how value iterations look like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def value_iteration(S, A, P, gamma=.99, theta = 0.0000001):\n",
    " \n",
    "    V = np.random.random(len(S))\n",
    "    for i in range(100000):\n",
    "        old_V = V.copy()\n",
    "        \n",
    "        Q = np.zeros((len(S), len(A)), dtype=float)\n",
    "        for s in S:\n",
    "            for a in A:\n",
    "                for prob, s_prime, reward, done in P[s][a]:\n",
    "                    Q[s][a] += prob * (reward + gamma * old_V[s_prime] * (not done))\n",
    "            V[s] = Q[s].max()\n",
    "        if np.all(np.abs(old_V - V) < theta):\n",
    "            break\n",
    "    \n",
    "    pi = np.argmax(Q, axis=1)\n",
    "    return pi, V"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, value iteration expects a set of states, e.g. (0,1,2,3,4) a set of actions, e.g. (0,1) and a set of transition probabilities that represent the dynamics of the environment. Let's take a look at these variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2017-08-27 08:15:35,098] Making new env: FrozenLake-v0\n"
     ]
    }
   ],
   "source": [
    "mdir = tempfile.mkdtemp()\n",
    "env = gym.make('FrozenLake-v0')\n",
    "env = wrappers.Monitor(env, mdir, force=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "S = range(env.env.observation_space.n)\n",
    "A = range(env.env.action_space.n)\n",
    "P = env.env.env.P"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "range(0, 16)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "S"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "range(0, 4)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{0: [(0.3333333333333333, 6, 0.0, False),\n",
       "  (0.3333333333333333, 9, 0.0, False),\n",
       "  (0.3333333333333333, 14, 0.0, False)],\n",
       " 1: [(0.3333333333333333, 9, 0.0, False),\n",
       "  (0.3333333333333333, 14, 0.0, False),\n",
       "  (0.3333333333333333, 11, 0.0, True)],\n",
       " 2: [(0.3333333333333333, 14, 0.0, False),\n",
       "  (0.3333333333333333, 11, 0.0, True),\n",
       "  (0.3333333333333333, 6, 0.0, False)],\n",
       " 3: [(0.3333333333333333, 11, 0.0, True),\n",
       "  (0.3333333333333333, 6, 0.0, False),\n",
       "  (0.3333333333333333, 9, 0.0, False)]}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "P[10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You see the world we are looking into \"FrozenLake-v0\" has 16 different states, 4 different actions. The `P[10]` is basically showing us a peek into the dynamics of the world. For example, in this case, if you are in state \"10\" (from `P[10]`) and you take action 0 (see dictionary key 0), you have equal probability of 0.3333 to land in either state 6, 9 or 14. None of those transitions give you any reward and none of them is terminal.\n",
    "\n",
    "In contrast, we can see taking action 2, might transition you to state 11, which **is** terminal. \n",
    "\n",
    "Get the hang of it? Let's run it!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pi, V = value_iteration(S, A, P)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, value iteration calculates two important things. First, it calculates `V`, which tells us how much should we expect from each state if we always act optimally. Second, it gives us `pi`, which is the optimal policy given `V`. Let's take a deeper look:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.54202426,  0.49880096,  0.47069307,  0.45684887,  0.55844941,\n",
       "        0.        ,  0.35834688,  0.        ,  0.59179743,  0.64307884,\n",
       "        0.61520669,  0.        ,  0.        ,  0.74171974,  0.86283707,  0.        ])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "V"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 3, 3, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See? This policy basically says in state `0`, take action `0`. In state `1` take action `3`. In state `2` take action `3` and so on. Got it?\n",
    "\n",
    "Now, we have the \"directions\" or this \"map\". With this, we can just use this policy and solve the environment as we interact with it.\n",
    "\n",
    "Let's try it out!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2017-08-27 08:16:30,009] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000000.json\n",
      "[2017-08-27 08:16:30,015] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000001.json\n",
      "[2017-08-27 08:16:30,024] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000008.json\n",
      "[2017-08-27 08:16:30,063] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000027.json\n",
      "[2017-08-27 08:16:30,116] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000064.json\n",
      "[2017-08-27 08:16:30,168] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000125.json\n",
      "[2017-08-27 08:16:30,245] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000216.json\n",
      "[2017-08-27 08:16:30,346] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000343.json\n",
      "[2017-08-27 08:16:30,461] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000512.json\n",
      "[2017-08-27 08:16:30,613] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video000729.json\n",
      "[2017-08-27 08:16:30,796] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video001000.json\n",
      "[2017-08-27 08:16:31,510] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video002000.json\n",
      "[2017-08-27 08:16:32,407] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video003000.json\n",
      "[2017-08-27 08:16:33,056] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video004000.json\n",
      "[2017-08-27 08:16:33,717] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video005000.json\n",
      "[2017-08-27 08:16:34,350] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video006000.json\n",
      "[2017-08-27 08:16:34,995] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video007000.json\n",
      "[2017-08-27 08:16:35,629] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video008000.json\n",
      "[2017-08-27 08:16:36,299] Starting new video recorder writing to /tmp/tmp8ebvhkul/openaigym.video.0.3760.video009000.json\n"
     ]
    }
   ],
   "source": [
    "for _ in range(10000):\n",
    "    state = env.reset()\n",
    "    while True:\n",
    "        state, reward, done, info = env.step(pi[state])\n",
    "        if done:\n",
    "            break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That was the agent interacting with the environment. Let's take a look at some of the episodes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://asciinema.org/a/cJ4n5wZKQJIxjwKpndi0OKmWX\n"
     ]
    }
   ],
   "source": [
    "last_video = env.videos[-1][0]\n",
    "out = check_output([\"asciinema\", \"upload\", last_video])\n",
    "out = out.decode(\"utf-8\").replace('\\n', '').replace('\\r', '')\n",
    "print(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can look on that link, or better, let's show it on the notebook:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<script type=\"text/javascript\" \n",
       "    src=\"https://asciinema.org/a/cJ4n5wZKQJIxjwKpndi0OKmWX.js\" \n",
       "    id=\"asciicast-cJ4n5wZKQJIxjwKpndi0OKmWX\" \n",
       "    async data-autoplay=\"true\" data-size=\"big\">\n",
       "</script>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "castid = out.split('/')[-1]\n",
    "html_tag = \"\"\"\n",
    "<script type=\"text/javascript\" \n",
    "    src=\"https://asciinema.org/a/{0}.js\" \n",
    "    id=\"asciicast-{0}\" \n",
    "    async data-autoplay=\"true\" data-size=\"big\">\n",
    "</script>\n",
    "\"\"\"\n",
    "html_tag = html_tag.format(castid)\n",
    "HTML(data=html_tag)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Interesting right? Did you get the world yet?\n",
    "\n",
    "So, 'S' is the starting state, 'G' the goal. 'F' are Frozen grids, and 'H' are holes. Your goal is to go from S to G without falling into any H. The problem is, F is slippery so, often times you are better of by trying moves that seems counter-intuitive. But because you are preventing falling on 'H's it makes sense in the end. For example, the second row, first column 'F', you can see how our agent was trying so hard to go left!! Smashing his head against the wall?? Silly. But why?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{0: [(0.3333333333333333, 0, 0.0, False),\n",
       "  (0.3333333333333333, 4, 0.0, False),\n",
       "  (0.3333333333333333, 8, 0.0, False)],\n",
       " 1: [(0.3333333333333333, 4, 0.0, False),\n",
       "  (0.3333333333333333, 8, 0.0, False),\n",
       "  (0.3333333333333333, 5, 0.0, True)],\n",
       " 2: [(0.3333333333333333, 8, 0.0, False),\n",
       "  (0.3333333333333333, 5, 0.0, True),\n",
       "  (0.3333333333333333, 0, 0.0, False)],\n",
       " 3: [(0.3333333333333333, 5, 0.0, True),\n",
       "  (0.3333333333333333, 0, 0.0, False),\n",
       "  (0.3333333333333333, 4, 0.0, False)]}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "P[4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See how action 0 (left) doesn't have any transition leading to a terminal state??\n",
    "\n",
    "All other actions give you a 0.333333 chance each of pushing you into the hole in state '5'!!! So it actually makes sense to go left until it slips you downward to state 8.\n",
    "\n",
    "Cool right?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 3, 3, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See how the \"prescribed\" action is 0 (left) on the policy calculated by value iteration?\n",
    "\n",
    "How about the values?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.54202426,  0.49880096,  0.47069307,  0.45684887,  0.55844941,\n",
       "        0.        ,  0.35834688,  0.        ,  0.59179743,  0.64307884,\n",
       "        0.61520669,  0.        ,  0.        ,  0.74171974,  0.86283707,  0.        ])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "V"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These show the expected rewards on each state."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{0: [(1.0, 15, 0, True)],\n",
       " 1: [(1.0, 15, 0, True)],\n",
       " 2: [(1.0, 15, 0, True)],\n",
       " 3: [(1.0, 15, 0, True)]}"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "P[15]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "See how the state '15' gives you a reward of +1?? These signal gets propagated all the way to the start state using Value Iteration and it shows the values all accross.\n",
    "\n",
    "Cool? Good."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2017-08-27 08:18:16,163] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmp8ebvhkul')\n"
     ]
    }
   ],
   "source": [
    "env.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to submit to OpenAI Gym, get your API Key and paste it here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2017-04-01 16:55:43,229] [FrozenLake-v0] Uploading 10000 episodes of training data\n",
      "[2017-04-01 16:55:44,905] [FrozenLake-v0] Uploading videos of 19 training episodes (2158 bytes)\n",
      "[2017-04-01 16:55:45,131] [FrozenLake-v0] Creating evaluation object from /tmp/tmpfukeltbz with learning curve and training video\n",
      "[2017-04-01 16:55:45,620] \n",
      "****************************************************\n",
      "You successfully uploaded your evaluation on FrozenLake-v0 to\n",
      "OpenAI Gym! You can find it at:\n",
      "\n",
      "    https://gym.openai.com/evaluations/eval_ycTPCbyiTWK6T0C4DyrvRg\n",
      "\n",
      "****************************************************\n"
     ]
    }
   ],
   "source": [
    "gym.upload(mdir, api_key='<YOUR OPENAI API KEY>')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Policy Iteration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is another method called policy iteration. This method is composed of 2 other methods, policy evaluation and policy improvement. The logic goes that policy iteration is 'evaluating' a policy to check for convergence (meaning the policy doesn't change), and 'improving' the policy, which is applying something similar to a 1 step value iteration to get a slightly better policy, but definitely not worse.\n",
    "\n",
    "These two functions cycling together are what policy iteration is about.\n",
    "\n",
    "Can you implement this algorithm yourself? Try it. Make sure to look the solution notebook in case you get stuck.\n",
    "\n",
    "I will give you the policy evaluation and policy improvement methods, you build the policy iteration cycling between the evaluation and improvement methods until there are no changes to the policy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def policy_evaluation(pi, S, A, P, gamma=.99, theta=0.0000001):\n",
    "    \n",
    "    V = np.zeros(len(S))\n",
    "    while True:\n",
    "        delta = 0\n",
    "        for s in S:\n",
    "            v = V[s]\n",
    "            \n",
    "            V[s] = 0\n",
    "            for prob, dst, reward, done in P[s][pi[s]]:\n",
    "                V[s] += prob * (reward + gamma * V[dst] * (not done))\n",
    "            \n",
    "            delta = max(delta, np.abs(v - V[s]))\n",
    "        if delta < theta:\n",
    "            break\n",
    "    return V"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy_improvement(pi, V, S, A, P, gamma=.99):\n",
    "    for s in S:\n",
    "        old_a = pi[s]\n",
    "        \n",
    "        Qs = np.zeros(len(A), dtype=float)\n",
    "        for a in A:\n",
    "            for prob, s_prime, reward, done in P[s][a]:\n",
    "                Qs[a] += prob * (reward + gamma * V[s_prime] * (not done))\n",
    "        pi[s] = np.argmax(Qs)\n",
    "        V[s] = np.max(Qs)\n",
    "    return pi, V"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "def policy_iteration(S, A, P, gamma=.99):\n",
    "    pi = np.random.choice(A, len(S))\n",
    "    while True:    \n",
    "        V = policy_evaluation(pi, S, A, P, gamma)\n",
    "        new_pi, new_V = policy_improvement(\n",
    "            pi.copy(), V.copy(), S, A, P, gamma)\n",
    "        if np.all(pi == new_pi):\n",
    "            break\n",
    "        pi = new_pi\n",
    "        V = new_V\n",
    "    return pi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After you implement the algorithms, you can run it and calculate the optimal policy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2017-08-27 08:21:56,074] Making new env: FrozenLake-v0\n",
      "[2017-08-27 08:21:56,078] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmpsqiqif_m')\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1 3 0 3 0 0 0 0 3 1 0 0 0 2 1 0]\n"
     ]
    }
   ],
   "source": [
    "mdir = tempfile.mkdtemp()\n",
    "env = gym.make('FrozenLake-v0')\n",
    "env = wrappers.Monitor(env, mdir, force=True)\n",
    "\n",
    "S = range(env.env.observation_space.n)\n",
    "A = range(env.env.action_space.n)\n",
    "P = env.env.env.P\n",
    "\n",
    "pi = policy_iteration(S, A, P)\n",
    "print(pi)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And, of course, interact with the environment looking at the \"directions\" or \"policy\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2017-08-27 08:21:59,041] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000000.json\n",
      "[2017-08-27 08:21:59,053] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000001.json\n",
      "[2017-08-27 08:21:59,059] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000008.json\n",
      "[2017-08-27 08:21:59,086] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000027.json\n",
      "[2017-08-27 08:21:59,127] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000064.json\n",
      "[2017-08-27 08:21:59,166] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000125.json\n",
      "[2017-08-27 08:21:59,214] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000216.json\n",
      "[2017-08-27 08:21:59,287] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000343.json\n",
      "[2017-08-27 08:21:59,375] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000512.json\n",
      "[2017-08-27 08:21:59,490] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video000729.json\n",
      "[2017-08-27 08:21:59,624] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video001000.json\n",
      "[2017-08-27 08:22:00,092] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video002000.json\n",
      "[2017-08-27 08:22:00,837] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video003000.json\n",
      "[2017-08-27 08:22:01,269] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video004000.json\n",
      "[2017-08-27 08:22:01,720] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video005000.json\n",
      "[2017-08-27 08:22:02,184] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video006000.json\n",
      "[2017-08-27 08:22:02,614] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video007000.json\n",
      "[2017-08-27 08:22:03,085] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video008000.json\n",
      "[2017-08-27 08:22:03,518] Starting new video recorder writing to /tmp/tmpqfn2e0ho/openaigym.video.4.3760.video009000.json\n"
     ]
    }
   ],
   "source": [
    "for _ in range(10000):\n",
    "    state = env.reset()\n",
    "    while True:\n",
    "        state, reward, done, info = env.step(pi[state])\n",
    "        if done:\n",
    "            break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://asciinema.org/a/NIeAt9sdjwkvmQbZSOVAkJOIb\n"
     ]
    }
   ],
   "source": [
    "last_video = env.videos[-1][0]\n",
    "out = check_output([\"asciinema\", \"upload\", last_video])\n",
    "out = out.decode(\"utf-8\").replace('\\n', '').replace('\\r', '')\n",
    "print(out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<script type=\"text/javascript\" \n",
       "    src=\"https://asciinema.org/a/NIeAt9sdjwkvmQbZSOVAkJOIb.js\" \n",
       "    id=\"asciicast-NIeAt9sdjwkvmQbZSOVAkJOIb\" \n",
       "    async data-autoplay=\"true\" data-size=\"big\">\n",
       "</script>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "castid = out.split('/')[-1]\n",
    "html_tag = \"\"\"\n",
    "<script type=\"text/javascript\" \n",
    "    src=\"https://asciinema.org/a/{0}.js\" \n",
    "    id=\"asciicast-{0}\" \n",
    "    async data-autoplay=\"true\" data-size=\"big\">\n",
    "</script>\n",
    "\"\"\"\n",
    "html_tag = html_tag.format(castid)\n",
    "HTML(data=html_tag)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar as before. Policies could be slightly different if there is a state in which more than one action give the same value in the end."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.54202426,  0.49880096,  0.47069307,  0.45684887,  0.55844941,\n",
       "        0.        ,  0.35834688,  0.        ,  0.59179743,  0.64307884,\n",
       "        0.61520669,  0.        ,  0.        ,  0.74171974,  0.86283707,  0.        ])"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "V"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 3, 0, 3, 0, 0, 0, 0, 3, 1, 0, 0, 0, 2, 1, 0])"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's it let's wrap up."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2017-08-27 08:22:29,264] Finished writing results. You can upload them to the scoreboard via gym.upload('/tmp/tmpqfn2e0ho')\n"
     ]
    }
   ],
   "source": [
    "env.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to submit to OpenAI Gym, get your API Key and paste it here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2017-04-01 20:40:54,103] [FrozenLake-v0] Uploading 10000 episodes of training data\n",
      "[2017-04-01 20:40:55,854] [FrozenLake-v0] Uploading videos of 19 training episodes (2278 bytes)\n",
      "[2017-04-01 20:40:56,102] [FrozenLake-v0] Creating evaluation object from /tmp/tmpyspcx0sa with learning curve and training video\n",
      "[2017-04-01 20:40:56,451] \n",
      "****************************************************\n",
      "You successfully uploaded your evaluation on FrozenLake-v0 to\n",
      "OpenAI Gym! You can find it at:\n",
      "\n",
      "    https://gym.openai.com/evaluations/eval_vAvbhsGQRVSAe5DZkFNrQ\n",
      "\n",
      "****************************************************\n"
     ]
    }
   ],
   "source": [
    "gym.upload(mdir, api_key='<YOUR OPENAI API KEY>')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Hope you liked it... Value Iteration and Policy Iteration might seem disappointing at first, and I understand. What is the intelligence on following given directions!? What if you just don't have a map of the environment you are interacting with? Come on, that's not AI. You are right, it is not. However, Value Iteration and Policy Iteration form the basis of 2 of the 3 most fundamental paradigms of algorithms in reinforcement learning.\n",
    "\n",
    "Next notebooks we start looking into slightly more complicated environment. And also, we will learn about algorithms that learn while interacting with the environment. Also called \"online\" learning."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
